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Abstract: Wc propose a method for incorporating variable selection into local 
polynomial regression. This can improve the accuracy of the regression by 
extending the bandwidth in directions corresponding to those variables judged 
to be are unimportant. It also increases our understanding of the dataset by 
highlighting areas where these variables arc redundant. The approach has the 
potential to effect complete variable removal as well as perform partial removal 
when a variable redundancy applies only to particular regions of the data. 
We define a nonparametric oracle property and show that this is more than 
satisfied by our approach under asymptotic analysis. The usefulness of the 
method is demonstrated through simulated and real data numerical examples. 



1. Introduction 

The classical regression problem is concerned with predicting a noisy continuous re- 
sponse using a d-dimcnsional predictor vector with support on some <i-dimensional 
subspacc. This functional relationship is often taken to be smooth and methods 
for estimating it range from parametric models, which specify the form of the rela- 
tionship between predictors and response, through to nonparametric models, which 
have fewer prior assumptions about the shape of the fit. An important consider- 
ation for fitting such a regression model is whether all d predictors are in fact 
necessary. If a particular predictor has no relationship to the response, the model 
be made both simpler and more accurate by removing it. In recent years there 
has been strong interest in techniques that automatically generate such "sparse" 
models. Most attention has been given to parametric forms, and in particular the 
linear model, where the response is assumed to vary linearly with the predictors. 
There has also been some investigation into variable selection for nonlinear models, 
notably through the use of smoothing splines and local regression. 

One common feature of the existing sparse methods is that the variable selection 
is "global" in nature, attempting to universally include or exclude a predictor. Such 
an approach does not naturally reconcile well with some nonparametric techniques, 
such as local polynomial regression, which focus on a "local" subset of the data to 
estimate the response. In this local context it would be more helpful to understand 
local variable influence, since predictors that are irrelevant in some regions may 
in fact be important elsewhere in the subspace. Just as in the global setting, such 
information would allow us to improve the accuracy and parsimony of a model, but 
at a local level. 
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However this approach to variable selection can be problematic. Most notably, 
variable significance affects the definition of "local". To illustrate concretely, sup- 
pose that two data points were close in every dimension except one. In typical local 
regression these points would not be considered close, and so the response at one 
point would not impact the other. If, however, we establish that the one predictor 
they differ by is not influential over a range that includes both these points, then 
they should actually be treated as neighbouring, and be treated as such in the 
model. Any methodology seeking to incorporate local variable influence needs to 
accommodate such potential situations. 

Understanding local variable significance can also give additional insight into a 
dataset. If a variable is not important in certain regions of the support, knowledge of 
this allows us to discount it in certain circumstances, simplifying our understanding 
of the problem. For example, if none of the variables are relevant in a region, we 
may treat the response as locally constant and so know that we can ignore predictor 
effects when an observation lies in this region. 

A final consideration is theoretical performance. In particular we shall present 
on approach that is "oracle"; that is, its performance is comparable to that of a 
particularly well-informed statistician, who has been provided in advance with the 
correct variables. It is interesting to note that variable interactions often cause 
sparse parametric approaches to fail to be oracle, but in the local nonparametric 
setting this is not an issue, because such interactions vanish as the neighbourhood 
of consideration shrinks. 

In this paper we propose a flexible and adaptive approach to local variable se- 
lection using local polynomial regression. The key technique is careful adjustment 
of the local regression bandwidths to allow for variable redundancy. The method 
has been named LABAVS, standing for "locally adaptive bandwidth and variable 
selection" . Section 2 will introduce the LABAVS algorithm, including a motivating 
example and possible variations. Section 3 will deal with theoretical properties and 
in particular establishes a result showing that the performance of LABAVS is better 
than oracle when the dimension remains fixed. Section 4 presents numerical results 
for both real and simulated data, showing that the algorithm can improve predic- 
tion accuracy and is also a useful tool in arriving at an intuitive understanding of 
the data. Technical details have been relegated to an appendix which may be found 
in the long version of this paper (Miller and Hall, 2010). 

LABAVS is perhaps best viewed as an improvement to local polynomial regres- 
sion, and will retain some of the advantages and disadvantages associated with 
this approach. In particular, it still suffers the "curse of dimensionality," in that it 
struggles to detect local patterns when the dimension of genuine variables increases 
beyond a few. It is not the first attempt at incorporating variable selection into local 
polynomial regression; the papers by Lafferty and Wasserman (2008) and Bertin 
and Lccue (2008) also do this. We compare our approach to these in some detail 
in Section 2.6. LABAVS can also be compared to other nonparametric techniques 
in use for low to moderate dimensions. These include generalised additive models, 
MARS and tree based methods (see Hastie et al., 2001). 

The earliest work on local polynomial regression dates back to that of Nadaraya 
(1964) and Watson (1964). General references on the subject include Wand and 
Jones (1995), Simonoff (1996) and Loader (1999). An adaptive approach to band- 
width selection may be found in Fan and Gijbels (1995), although this was not in 
the context of variable selection. Tibshirani (1996) studies the LASSO, one of the 
most popular sparse solutions for the linear model; more recent related work on the 
linear model includes that of Candes and Tao (2007) and Bickel et al. (2009). Zou 
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(2006) created the adaptive version of the LASSO and proved oracle performance 
for it. Lin and Zhang (2006) and Yuan and Lin (2006) have investigated sparse 
solutions to smoothing spline models. 

The LABAVS algorithm also bears some similarity to the approach adopted 
by Hall et al. (2004). There the aim was to estimate the conditional density of a 
response using the predictors. Cross-validation was employed and the bandwidths in 
irrelevant dimensions diverged, thereby greatly downweighting those components. 
In the present paper the focus is more explicitly on variable selection, as well as 
attempting to capture local variable dependencies. 



2. Model and Methodology 
2.1. Model and definitions 

Suppose that we have a continuous response Yi and a d-dimensional random pre- 
dictor vector Xi = (X- X \ . . . ,xf^) which has support on some subspace C C R d . 
Further, assume that the observation pairs (Y^JQ) are independent and identically 
distributed for i = 1, . . . ,n, and that Xi has density function /. The response is 
related to the predictors through a function g, 

(2.1) Y t = g{Xi) + e, , 

with the error having zero mean and fixed variance. Smoothness conditions for 
/ and g will be discussed in the theory section. 

Local polynomial regression makes use of a kernel and bandwidth to assign 
increased weight to neighbouring observations compared to those further away, 
which will often have zero weight. We take K{u) = IIi<j<d^*( M ^) to be the 
rf-dimensional rectangular kernel formed from a one dimensional kernel K* such as 
the tricubic kernel, 

K*(u^) = (35/32)(l - x 2 ) 3 I(|x| < 1) . 

Assume K* is symmetric with support on [—1,1]. For d x d bandwidth matrix H 
the kernel with bandwidth H, denoted Kh, is 

(2-2) K H (u) = j^K(H-^u). 

We assume that the bandwidth matrices are diagonal, H — diag(ft, 2 , . . . , h^), with 
each hj > 0, and write H{x) when H varies as a function of x. Asymmetric band- 
widths can be defined as having both a lower and an upper (diagonal) bandwidth 
matrix, H L and H u respectively, for a given estimation point x, rather than a 
single bandwidth H for all x. The kernel weight of an observation X, at estimation 
point x with asymmetrical bandwidth matrices H L {x) and H u (x), is 

-r-r 1 ( X U) - \ 

(2.3) K H u {x) , HHx) { Xl - X ) = [] ILM K 



hf(x)" V W.{x) 
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This amounts to having (possibly) different window sizes above and below x in each 
direction. Although such unbalanced bandwidths would often lead to undesirable 
bias properties in local regression, here they will be used principally to extend 
bandwidths in dimensions considered redundant, so this issue is not a concern. 

We also allow the possibility of infinite bandwidths hj =00. In calculating the 
kernel in (2.2) when hj is infinite, proceed as if the jth dimension did not exist (or 
equivalently, as if the jth factor in rectangular kernel product is always equal to 1). If 
all bandwidths arc infinite, consider the kernel weight to be 1 everywhere. Although 
the kernel and bandwidth conditions above have been defined fairly narrowly to 
promote simplicity in exposition, many of these assumptions are easily generalised. 

Local polynomial regression estimates of the response at point x, g(x), are found 
by fitting a polynomial q to the observed data, using the kernel and bandwidth 
to weight observations. This is usually done by minimising the weighted sum of 
squares, 



n 

(2.4) ~ l( X i ~ x )) 2K H{Xi - x) . 

1=1 



Once the minimisation has been performed, q(0) becomes the point estimate for 
g(x). The polynomial is of some fixed degree p, with larger values of p generally 
decreasing bias at the cost of increased variance. Of particular interest in the the- 
oretical section will be the local linear fit, which minimises 



( 2 -5) E 

»=i 



over 70 and 7 = (71, ... , -f d ). 



2.2. The LABAVS Algorithm 



Below is the LABAVS algorithm that will perform local variable selection and 
vary the bandwidths accordingly. The choice of H in the first step can be local 
or global and should be selected as for a traditional polynomial regression, using 
cross-validation, a plug-in estimator or some other standard technique. Methods 
for assessing variable significance in Step 2, and the degree of shrinkage needed in 
Step 4, are discussed below. 
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LABAVS Algorithm 

1. Find a starting d x d bandwidth H = diag(/i 2 , . . . , h 2 ). 

2. For each point x of a representative grid in the data support, perform local 
variable selection to determine disjoint index sets A + (x), A~(x), with A + (x) U 
A~(x) = {l,...,d}, for variables that are considered relevant and redundant 
respectively. 

3. For any given x, derive new local bandwidth matrices H L (x) and H u (x) by 
extending the bandwidth in each dimension indexed in A~(x). The resulting 
space given nonzero weight by the kernel K h l^ x ^ h u i x \{u — x) is the rectangle 
of maximal area with all grid points x inside the rectangle satisfying A + (x ) C 
A + (x). Here A + (x) is calculated explicitly as in Step 2, or taken as the set 
corresponding the closest grid point to x. 

4. Shrink the bandwidth slightly for those variables in A + {x) according to the 
amount that band widths have increased in the other variables. 

5. Compute the local polynomial estimator at x, excluding variables in A~(x) and 
using adjusted assymetrical bandwidths H L (x) and H u (x). The expression to 
be minimised is 

n 

^2{Y, - q(X t - x)} 2 K H L {x)tH u {x) (Xi - x) , 
i=l 

where the minimisation runs over all polynomials q of appropriate degree. The 
value of q(0) in the minimisation is the final local linear estimator. 

The key feature of the algorithm is that variable selection directly affects the 
bandwidth, increasing it in the direction of variables that have no influence on 
the point estimator. If a variable has no influence anywhere, it has the potential 
to be completely removed from the local regression, reducing the dimension of the 
problem. For variables that have no influence in certain areas, the algorithm achieves 
a partial dimension reduction. The increased bandwidths reduce the variance of the 
estimate and Step 4 swaps some of this reduction for a decrease in the bias to further 
improve the overall estimator. 

As a concrete example of the approach, define the following one-dimensional 
"huberised" linear function: 

(2.6) g(x) = .t 2 I(0 < x < 0.4) + (0.8a; - 0.16)1(2; > 0.4) , 

and let g(X) = g{{[X^]\ + [X^] 2 ^) 1 / 2 } for 2-dimcnsional random variable X = 
(X^\X^). Assume that X is uniformly distributed on the space [—2, 2] x [—2, 2]. 
Notice that when A^ 1 ), X^ < the response variable Y in (2.1) is independent 
of A^) and A< 2 ); when A^ < and A< 2 ) > the response depends on A< 2 > 
only; when X^ > and A^ 2 ) < the response depends on A^ only; when 
A^ 2 ' > the response depends on both A^) and A^ 2 '. Thus in each of 
these quadrants a different subset of the predictors is significant. A local approach 
to variable significance can capture these different dependencies, while a global 
variable redundancy test would not eliminate any variables. 

Now consider how the algorithm applies to this example, starting with a uniform 
initial bandwidth of h = 0.5 in both dimensions. Assuming that variable significance 
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Fig 1. Bandwidth adjustments under ideal circumstances in illustrative example. 



perfectly on a dense grid, figure 1 illustrates the adjusted bandwidths for each 
of the quadrants. The dots are four sample estimation points, the surrounding 
unit squares indicate the initial bandwidths and the dashed lines indicate how the 
bandwidths are modified. In the bottom left quadrant both variables are considered 
redundant, and so the bandwidth expands to cover the entire quadrant. This is 
optimal behaviour, since the true function is constant over this region, implying 
that the best estimator will be produced by including the whole area. In the bottom 
right quadrant the first dimension is significant while the second is not. Thus the 
bandwidth for the second dimension is "stretched", while the first is shrunken 
somewhat. Again, this is desirable for improving the estimator. The stretching in the 
second dimension improves the estimator by reducing the variance as more points 
are considered. Then the shrunken first dimension swaps some of this reduction in 
variance for decreased bias. Finally, in the top right quadrant, there is no change 
in the bandwidth since both variables are considered to be significant. 



2.3. Variable selection step 



Below are three possible ways to effect variable selection at x in Step 2 of the 
algorithm, presented in the context of local linear regression. They all make use 
of a tuning parameter A which controls how aggressive the model is in declaring 
variables as irrelevant. Cross validation can be used to select an appropriate level 
for A. So that the tuning parameters are comparable at different points in the data 
domain, it is useful to consider a local standardisation of the data at x . Define 
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( .) EtiX^KHjXj-xo) - T,7=iYiK H (Xi - xq) 

and define X, = (X^, . . . , xf ] ) and Y- by 

4 {EZ =1 (^ ) -^ ) ) a ^-*o)} 1/a ' 
£ - (y,-f xo )^(^-xo) 1/2 . 

Notice that X and Y incorporate the weight Kn(Xi — x ) into the expression. 

1. Hard thresholding: Choose parameters to minimise the weighted least 
squares expression, 

2 

(2.8) E^-A>-E*i w V 

i=l [ j=l 

and classify as redundant those variables for which \$j\ < A. This can be 
extended to higher degree polynomials, although performance tends to be 
more unstable. 

2. Backwards stepwise approach: For each individual j, calculate the per- 
centage increase in the sum of squares if the jth variable is excluded from the 
local fit. Explicitly, if q is the optimal local fit using all variables and qj is the 
fit using all except the jih, we classify the jth variable as redundant if 

E?=i \Yi - QjiXi)} 2 K H (Xi) - {y - qiXi)} 2 K H {X t ) 

(2.9) != >- > <A, 

Eti{Y-q(X)} K H (X t ) 

where Xi = Xi — xq. 

3. Local lasso: Minimise the expression 

n ( n ^ d 

(2-10) £ +A Etol" 

Those variables for which 7j are set to zero in this minimisation are then 
classed as redundant. While the normal lasso can have consistency problems 
(Zou, 2006), this local version does not since variables are asymptotically 
independent as h — > 0. The approach also scales naturally to higher order 
polynomials, provided all polynomial terms are locally standardised; a vari- 
able is considered redundant if all terms that include it have corresponding 
parameters set to zero by the lasso. 

We have found that the first and second of the above approaches have produced 
the most compelling numerical results. The numerical work in Section 4 uses the 
first approach for linear polynomials, while the theoretical work in Section 3 estab- 
lishes uniform consistency for both of the first two methods, guaranteeing oracle 
performance. 
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2-4- Variable shrinkage step 

The variable shrinkage step depends on whether the initial bandwidth, and thus 
the shrunken bandwidth h', is chosen locally or globally. Define 



(2- 11 ) V(x,H)= rv ^ lv 



where the bandwidth term in the function V is allowed to be asymmetrical. Then 
in the local case, letting d'(x) denote the cardinality of A(x), let 

(2.12) M(x) = V[x,{H L (x),H L (x)}]/V(x,H). 

The expression is asymptotically proportional to {h!(x)}~ d ^ and estimates the de- 
gree of variance stabilisation resulting from the bandwidth adjustment. Using this, 
the correct amount of bandwidth needed in step 4 is h'{x) = h{M(x)d'(x)/d} 1 ^ 4 . 
Since both sides of this expression depend on h'(x), shrinkage can be approximated 
in the following way. Let 

M*(x) = V[x, {H L (x),H L (x)}]/V(x, H) , 

where H L {x) and H u (x) are the bandwidth matrices immediately after step 3. 
Then the shrunken bandwidths are h'(x) = h{M*{x)d'{x)/d) l ^ d '^ +i \ 
In the global bandwidth case, we define 

(, 13) m [{ h W( x %H] . m^^pm . 

This expression measures the average variance stabilisation across the domain. In 
this case, the shrinkage factor should satisfy 



(2.14) h' = h (M[{H L {X), H U {X)}, H]E{d'(X)}/d) 



1/4 



The theoretical properties in Section 3 deal with the global bandwidth scenario. The 
treatment for the local case is similar, except that care must be taken in regions of 
the domain where the function g behaves in a way that is exactly estimable by a 
local polynomial and thus has potentially no bias. 



2.5. Further remarks 

1. The choice of distance between grid points in Step 2 is somewhat arbitrary, 
but should be taken as less than h so that all data points are considered 
in calculations. In the asymptotic theory we let this length decrease faster 
that the rate of the bandwidth, and in numerical experimentation the choice 
impacts only slightly on the results. 

2. Step 5 of the algorithm forces the estimate at point x to exclude variables 
indexed in A~(x). An alternative is to still use all variables in the final fit. 
This may be advantageous in situations with significant noise, where variable 
admission and omission is more likely to have errors. Despite including these 
extra variables, the adjusted bandwidths still ensure that estimation accuracy 
is increased. 
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3. Finding the maximal rectangle for each representative point, as suggested in 
step 3 of the algorithm, can be a fairly intensive computational task. In our 
numerical work we simplified this by expanding the rectangle equally until the 
boundary met a "bad" grid point (i.e. a point x' such that A + (x') <£. A + {x)). 
The corresponding direction was then held constant while the others continue 
to increase uniformly. We continued until each dimension stopped expanding 
or grew to be infinite. This approach does not invalidate the asymptotic results 
in Section 3, but there may be some deterioration in numerical performance 
associated with this simplification. 

4. If a variable is redundant everywhere, results in Section 3 demonstrate that 
the algorithm is consistent; the probability that the variable is classified as 
redundant everywhere tends to 1 as n grows. However, the exact probability 
is not easy to calculate and for fixed n we may want greater control over 
the ability to exclude a variable completely. In such circumstances a global 
variable selection approach may be appropriate. 

5. As noted at the start of Section 2.2, the initial bandwidth in Step 1 does 
not necessarily have to be fixed over the domain. For instance, a nearest 
neighbour bandwidth, where ft, at a; is roughly proportional to f(x) , could 
be used. Employing this approach offers many practical advantages and the 
theoretical basis is similar to that for the constant bandwidth. The numerical 
work makes use of nearest neighbour bandwidths throughout. In addition, we 
could use an initial bandwidth that was allowed to vary for each variable, 
H = diag(/i^, . . . , hp). So long as, asymptotically, each hj was equal to Cjh 
for some controlling bandwidth h and constant Cj, the theory would hold, 
although details are not pursued here. 

2.6. Comparison to other local variable selection approaches 

As mentioned in the introduction, two recent papers take a similar approach to this 
problem. Firstly Lafferty and Wasserman (2008) introduce the rodeo procedure. 
This attempts to assign adaptive bandwidths based on the derivative with respect 
to the bandwidth for each dimension, dg(x) / dhj. This has the attractive feature of 
bypassing the actual local shape and instead focussing on whether an estimate is 
improved by shrinking the bandwidths. It is also a greedy approach, starting with 
large bandwidths in each direction and shrinking only those that cause a change 
in the estimator at a point. The second paper is by Bertin and Lecue (2008), who 
implement a two step procedure to reduce the dimensionality of a local estimate. 
The first step fits a local linear estimate with an L\ or lasso type penalty, which 
identifies the relevant variables. This is followed by a second local linear fit using 
this reduced dimensionality. The lasso penalty they use is precisely the same as the 
third approach suggested in Section 2.3. 

We comment on the similarities and differences of these two approaches com- 
pared to the current presentation, which arc summarised in Table 1. Firstly the 
theoretical framework of the two other papers focus exclusively on the performance 
at a single point, while the LABAVS approach ensures uniformly oracle performance 
on the whole domain. The framework for the other two also assumes that variables 
are either active on the whole domain or redundant everywhere, while we have 
already discussed the usefulness of an approach that can adapt to variables that 
are redundant on various parts of the data. We believe this is particularly impor- 
tant, since local tests of variable significance will give the same results everywhere. 
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Related to this, our method does not require an assumption of nonzero gradients 
(whether with respect to the bandwidth or variables) to obtain adequate theoretical 
performance, in contrast to the other methods. On the other hand, ensuring both 
uniform performance while allowing d to be increasing is a quite challenging, so our 
presentation assumes d is fixed, while the others do not. It is also worth noting that 
the greedy approach of Lafferty and Wasserman potentially gives it an advantage 
in higher dimensional situations. 

While all approaches work in a similar framework, the above discussion demon- 
strates that there are significant differences. Our methodology may be viewed as 
a generalisation of the work of Bcrtin and Lecue, save for imposing fixed dimen- 
sionality. It can also be viewed as a competitor to the rodeo, and some numerical 
examples comparing the two are provided. 



Table 1 

Summary of locally adaptive bandwidth approaches 





LABAVS 


Rodeo 


Bertin and 
Lecue (2008) 


Oracle performance on entire domain 


✓ 


X 


X 


Allows for locally redundant variables 


✓ 


X 


X 


Relevant variables allowed to have zero gradient 


✓ 


X 


X 


Theory allows dimension d to increase with n 


X 


✓ 


✓ 


Greedy algorithm applicable for higher dimensions 


X 


✓ 


X 



With regards to computation time, for estimation at a single point the rodeo is 
substantially faster, since calculating variable significance on a large grid of points 
is not required. If however we need to make predictions at a reasonable number of 
points, then Labavs is likely to be more efficient, since the grid calculations need 
only be done once, while rodeo requires a new set of bandwidth calculations for 
each point. 

3. Theoretical properties 

As mentioned in the introduction, a useful means of establishing the power of a 
model that includes variable selection is to compare it with an oracle model, where 
the redundant variables are removed before the modelling is undertaken. In the 
linear (and the parametric) context, we interpret the oracle property as satisfying 
two conditions as n — > oo: 

1. the probability that the correct variables are selected converges to 1, and 

2. the nonzero parameters are estimated at the same asymptotic rate as they 
would be if the correct variables were known in advance. 

We wish to extend this notion of an oracle property to the nonparametric setting, 
where some predictors may be redundant. Here there are no parameters to estimate, 
so attention should instead be given to the error associated with estimating g. Below 
we define weak and strong forms of these oracle properties: 

Definition 1. The weak oracle property in nonparametric regression is: 

1. the probability that the correct variables are selected converges to 1, and 

2. at each point x the error of the estimator g{x) decreases at the same asymp- 
totic rate as it would if the correct variables were known in advance. 
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Definition 2. The strong oracle property in nonparametric regression is: 

1. the probability that the correct variables are selected converges to 1, and 

2. at each point x the error of the estimator g(x) has the same first-order asymp- 
totic properties as it would if the correct variables were known in advance. 

Observe that the weak oracle property achieves the correct rate of estimation 
while the strong version achieves both the correct rate and the same asymptotic 
distribution. The first definition is most analogous to its parametric counterpart, 
while the second is more ambitious in scope. 

Here we establish the strong version of the nonparametric oracle property for 
the LABAVS algorithm, with technical details found in the appendix (Miller and 
Hall, 2010). We shall restrict attention to the case of fixed dimension. To apply 
to a situation of increasing dimension, we could add an asymptotically consistent 
screening method to reduce it back to fixed d. The treatment here focuses on local 
linear polynomials, partly for convenience but also recognising that the linear factors 
dominate higher order terms in the asymptotic local fit. Thus our initial fit is found 
by minimising the expression (2.5). We impose further conditions on the kernel K: 
(3.1) 

/ K(z)dz = 1, / z^K(z)dz = for each j, J z^z^K{z)dz = when j ^ k and 
J(z^) 2 K(z)dz = fj, 2 (K) > 0, with fj, 2 (K) independent of j. 

The useful quantity R(K), depending on the choice of kernel, is defined as 

R(K) = J K{zfdz = ^J K*(z^) 2 dz {l) 

where K* is the univariate kernel introduced on page 3. Let a n x b n denote the 
property that a n — 0(b n ) and b n — 0{a n )- We also require the following conditions 
(3.2), needed to ensure uniform consistency of our estimators. 
(3.2) 

1. The support C = {x : f{x) > 0} of the random variable X is compact. 
Further, / and its first order partial derivatives are bounded and uni- 
formly continuous on C, and inf xe c f(x) > 0. 

2. The kernel function K is bounded with compact support and satisfies 
\p(u)K{u) — p(v)K(v)\ < C\\\u — v\\ for some C\ > and all points u, v 
in C. Here p(u) denotes a single polynomial term of the form rj(uW) ^ 
with the nonnegative integers aj satisfying a j ^ 4. The bound C\ 
should hold for all such choices of p. 

3. The function g has bounded and uniformly continuous partial derivatives 
up to order 2. If (D k g)(x) denotes the partial derivative 

d^g(x) 
d(xW) k i • • • d(xW) k * ' 

with |fc| = J^kj, then we assume that these derivatives satisfy, for some 
constant C 2 , 

\h{u) - h(v)\ < C 2 \\u - v\\ . 
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4. E(\Y\ a ) < oo for some a > 2. 

5. The conditional density fx\Y( x \y) 01 conditional on Y, exists and is 
bounded. 

6. 



n l-2/a h d 



For some < £ < 1, — wuv^l > 00 • 

log njlog n(log log n) 1+0 ) A i a 

7. The Hessian of g, T-L g , is nonzero on a set of nonzero measure in C. 



The conditions in (3.2), except perhaps the first, are fairly natural and not overly 
constrictive. For example, the sixth will occur naturally for any reasonable choice of 
h, while the second follows easily if K has a bounded derivative. The last condition 
is purely for convenience in the asymptotics; if H g was zero almost everywhere then 
g would be linear and there would be no bias in the estimate, improving accuracy. 
The first condition will not apply if the densities trail off to zero, rather than 
experiencing a sharp cutoff at the boundaries of C. However, in such circumstances 
our results apply to a subset of the entire domain, chosen so that the density did not 
fall below a specified minimum. Performance inside this region would then conform 
to the optimal accuracies presented, while estimation outside this region would 
be poorer. This distinction is unavoidable, since estimation in the tails is usually 
problematic and it would be unusual to guarantee uniformly good performance 
there. 

Step 1 of the LABAVS Algorithm allows the initial bandwidth to be chosen 
globally or locally. Here we shall focus on the global case, where an initial bandwidth 
H = diag(/i 2 , . . . , h 2 ) is used. Further, we assume that this h is chosen to minimise 
the mean integrated squared error (MISE): 



E 



J {g(x) - g(x)} 2 f(x)dx 



where the outer expectation runs over the estimator g. It is possible to show that 
under our assumptions that 



(3.3) 



da 2 R{K)A e 
nn 2 (K) 2 A H 



-l/(d+4) 



Notice in particular that h >; n 1 /^ 4 ). Details are given in Lemma A.l in the 
appendix (Miller and Hall, 2010). 

A key result in establishing good performance, in Theorem 3.1 below, is uniform 
consistency of the local polynomial parameter estimates. It is a simplified version 
of a result by Masry (1996), and no proof is included. 

Theorem 3.1. Suppose the conditions in (3.2) hold and we use parameter estimates 
from a degree p polynomial regression to estimate the partial derivatives of g. Then 
for each k with < |fc| < p we have 



su V \(D«g)(x)-(D k g)(x)\=0 



( lQ g 



n 



y nh d+2\k\ 



1/2 



0(h p ~ lkl+1 ) almost surely. 
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Since the partial derivative estimate at x is proportional to the corresponding 
local polynomial coefficient, Theorem 1 ensures that the local polynomial coeffi- 
cients are consistently estimated uniformly for suitable h. The scaling applied in 
(2.7) does not impact on this, as the proof of Theorem 3.2 demonstrates. 

Let C~ denote the points x € C satisfying dg(x) / dx^ — and d 2 g(x)/dx^ 2 ^ 
0. That is, C~ denotes the points where the true set of relevant variables changes. 
Notice that in the illustrative example in Section 2.2 we had C~ = {x | x^ — 
, x^ — 0}. The smoothness assumed of g implies that C~ has Lebesgue measure 
0. Let 5 > and let 0$ be an open set in C~ such that 

(3.4) inf Ji=S. 

xeC+\O s ,jeA+(x) 

Intuitively this means that on the set C \ Os the relevant variables have the abso- 
lute value of their corresponding parameters \^j\ bounded below by 5 > 0, while 
irrelevant variables have jj = 0. Thus we have a "gap" between the true and irrel- 
evant variables in this region that we may exploit. The volume of Os may be made 
arbitrarily small by choosing S small. Call the set A + {x) in the algorithm correct 
if the variables in it are the same as the set of variables j with dg(x) / dx^ ^ 0. 
Denote the latter correct set by A + (x). 

Theorem 3.2. Suppose 5 is given, h is chosen to minimise squared error as 
in (3.3), A + (x) is formed using the first approach in Section 2.3, and A has a 
growth rate between arbitrary constant multiples of h 2 {n log n) and hn 1 / 2 . If f 
has bounded and uniformly continuous derivatives of degree 2, then the probability 
thatA + (x) is correct onC\Os tends to 1 as n — > oo. Furthermore, variables that are 
genuinely redundant everywhere will be correctly classified as such with probability 
tending to 1. 

The property (3.4) ensures that the coefficients in the local linear fit are consis- 
tently estimated with error of order Oj/^logn) 1 / 2 }. The adjustment in (2.7) means 
that the actual coefficients estimated are of order hn 1 / 2 times this, so the range 
of A given is correct for separating true and redundant variables. The definition 
of Os ensures that the classification is correct on C\Os, while variables that are 
redundant everywhere will be recognised as such. 

The next result ensures consistency for the second approach in Section 2.3. We 
make one further assumption, concerning the error ej. Observe that this holds triv- 
ially if Ci is bounded. Assume that: 
(3.5) 

there exists C 3 such that \E(ef)\ < Cg~ 2 a 2 for a = 3,4, . . .. 

Theorem 3.3. Suppose 5 is given, h is chosen to minimise squared error as in 
(3.3), and A + (x) is formed using the second approach in Section 2.3. Provided that 
A = o(h 2 ) and h 4 log n — o(A), the probability that A + {x) is correct onC\Os tends 
to 1 as n — > oo. Furthermore, variables that are genuinely redundant everywhere 
will be correctly classified as such with probability tending to 1. 

The previous two results ensure that wc have consistent variable selection for the 
first two approaches in Section 2.3. Finally we can state and prove the strong oracle 
property for C\Og- Although the result does not cover the whole space C, recall 
that we may make the area Os arbitrarily small by decreasing S. Furthermore, the 
proof implies that if we restricted attention to removing only those variables that 
are redundant everywhere, we would actually have the oracle property on the whole 
of C; however we sacrifice this performance on Os to improve the fit elsewhere by 
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adjusting for locally redundant variables. In the following theorem the matrix H is 
the diagonal bandwidth matrix with bandwidth oo for globally redundant variables 
and h for the other variables, where 

/ \i/4 
h = h(M{H,H)d/d) 

Here d denotes the number of variables that are not globally redundant. 

Theorem 3.4. The estimates produced by the algorithm, where variable selection 
is performed using the first or second approach in Section 2.3, satisfy the strong 
definition of the nonparametric oracle property on C. Further, when there are lo- 
cally redundant variables, squared estimation error is actually less that the oracle 
performance by a factor of M[{H L (X), H u (X)}, H] < 1. 

4. Numerical properties 

The examples presented in this section compare the performance of two versions 
of the LABAVS algorithm with ordinary least squares, a traditional local linear 
fit, generalised additive models, tree-based gradient boosting and MARS. Table 2 
describes the approaches used. The implementations of the latter four methods 
were from the R packages locfit, gam, gbm and polspline respectively. Tuning pa- 
rameters such as bandwidths for local methods, A in LABAVS, number of trees in 
boosting, and MARS model complexity, were chosen to give best performance for 
each method. The LABAVS models used the first variable selection approach of Sec- 
tion 2.3. All the local methods used nearest neighbour bandwidths. The OLS linear 
model was included as a standard benchmark, but obviously will fail to adequately 
detect nonlinear features of a datasct. 

Table 2 

Approaches included in computational comparisons 



Name Description 

LABAVS-A LABAVS with linear fit, all vars in final fit 

LABAVS-B LABAVS with linear fit, relevant vars only in final fit 

LOCI Local linear regression 

OLS Ordinary least squares linear regression 

GBM Boosting with trees, depth equal to three 

GAM Generalised additive models with splines 

MARS Multivariate adaptive regression splines 



Example 1: The example introduced in Section 2.2 was simulated with n = 500. 
The error for Yj was normal with standard deviation 0.3. We first compare LABAVS 
to the rodeo and the methodology of Bertin and Lecue (2008) at the four represen- 
tative points in Figure 1. Table 3 shows the mean squared error of the prediction 
compared to the true value over 100 simulations. In all cases parameters were cho- 
sen to minimise this average error. At all points the LABAVS approach performed 
strongest. The method of Bertin and Lecue (2008) performed poorly in situations 
where at least one variable is redundant; this is to be expected, since it excludes the 
variable completely and so will incorporate regions where it is actually important, 
causing significant bias. The rodeo also did not perform as well; we found it tended 
to overestimate the optimal bandwidths in redundant directions. 

We then compared LABAVS with the other model approaches which are designed 
to make multiple predictions, rather than a specific point. For each simulation all 
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Fig 2. Plot of detected variable significance across subspace in Example 1. 

Table 3 

Mean squared prediction error on sample points in Example 1 

Test Point LABAVS-A LABAVS-B rodeo Bertin and 

Lecue 

(1,1) 0.0022 0.0022 0.0065 0.0023 

(1,-1) 0.0011 0.0013 0.0015 0.0018 

(-1,1) 0.0009 0.0011 0.0015 0.0013 

(-1,-1) 0.0006 0.0007 0.0008 0.0013 



the models were fitted and the average squared error was estimated using a separate 
test set of 500 observations. The simulation was run 100 times and the average error 
and its associated standard deviation for each model are recorded in Table 4. 

Inspection of the results shows that the LABAVS models performed best, able 
to allow for the different dependencies on the variables. In particular the algorithm 
improved on the performance of the local linear model on which it is based. The 
local linear regression, the boosted model and MARS also performed reasonably, 
while the GAM struggled with the nonadditive nature of the problem, and a strict 
linear model is clearly unsuitable here. 

To show how effectively variable selection is for LABAVS, Figure 2 graphically 
represents the sets A + at each grid point for one of the simulations, with the darkest 
representing {}, the next darkest {1}, the next darkest {2} and finally the lightest 
{1,2}. Here the variable selection has performed well; there is some encroachment 
of irrelevant variables into the wrong quadrants but the selection pattern is broadly 
correct. The encroachment is more prevalent near the boundaries since the band- 
widths are slightly larger there, to cover the same number of neighbouring points. 
Example 2: We next show that LABAVS can effectively remove redundant vari- 
ables completely. Retain the setup of Example 1, except that we add d* = d — 2 
variables similarly distributed (uniform on [-2,2]), which have no influence on the 



imsart-coll ver. 2009/08/13 file: miller-hall-labavs.tex date: June 18, 2010 



16 



H. Miller and P. Hall 



Table 4 

Mean squared error sum of test dataset in Example 1 



Approach 


Error 


Std Dcv 


LABAVS-A 


2.18 


(0.71) 


LABAVS-B 


1.87 


(0.65) 


LOCI 


2.31 


(0.73) 


OLS 


42.85 


(2.64) 


GBM 


2.47 


(0.67) 


GAM 


5.93 


(0.57) 


MARS 


2.35 


(0.90) 



response. Also, keep the parameters relating to the LABAVS fit the same as the 
previous example, except that the cutoff for hard threshold variable selection, A is 
permitted to vary. Table 5 shows the proportion of times from 500 simulations that 
LABAVS effected complete removal of the redundant dimensions, for various A and 
p* . Note that the cutoff level of 0.55 is that used in the previous example, and the 
two genuine variables were never completely removed in any of the simulations. The 
results suggest that to properly exclude redundant variables, a higher threshold is 
needed than would otherwise be the case. This causes the final model to be slightly 
undcrfit when compared to the oracle model, but this effect is not too severe; Fig- 
ure 3 shows how the variable significance plots change for a particular simulation 
with different values of the cutoff. It is clear that the patterns are still broadly cor- 
rect, and the results still represent a significant improvement to traditional linear 
regression. 

Table 5 

Proportion of simulations where redundant variables completely removed by LABAVS 

Number of redundant dimensions 
~\ 1 2 3 4 



0.55 0.394 0.086 0.034 0.038 
0.65 0.800 0.542 0.456 0.506 
0.75 0.952 0.892 0.874 0.864 




-2-1012 -2-101 



X (D X (D 

Fig 3. Plot of detected variable significance across subspace in Example 2, under various choices 
for A. 

Example 3: The first real data example used is the ozone dataset from Hastie et 
al. (2001), p. 175. It is the same as the air dataset in S-PLUS, up to a cube root 
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transformation in the response. The dataset contains meteorological measurements 
for New York collected from May to September 1973. There are 111 observations 
in the dataset, a fairly moderate size. Our aim here is to predict the ozone con- 
centration using two of the other variables, temperature and wind and scaled to 
unit variance when fitting the models. The smoothed perspective plot of the data 
in Figure 4 shows strong dependence on each of the two variables in some parts 
of the domain, but some sections appear flat in one or both directions in other 
parts. For example, the area surrounding a temperature of 70 and wind speed of 
15 appears to be flat, implying that for reasonably low winds and high temper- 
atures the ozone concentration is fairly stable. This suggests that LABAVS, by 
expanding the bandwidths here, could be potentially useful in reducing error. We 
performed a similar comparative analysis to that in Example 1, except that error 
rates were calculated using leave-one-out cross validation, where an estimate for 
each individual observations was made after using all other observations to build 
the model. The resulting mean squared errors and corresponding standard devia- 
tions are presented in Table 6. The results suggest that the data is best modelled 
using local linear methods, and that LABAVS offers a noticeable improvement over 
a traditional local fit, due to its ability to improve the estimate in the presence of 
redundant variables. The perspective plot in left panel of Figure 4 suggests a highly 
non-additive model, which may explain why GAM performs poorly. There is also a 
large amount of local curvature, which hinder the OLS, GBM and MARS fits. The 
right panel of Figure 4 shows the variable selection results for the linear version of 
LABAVS across the data support, using the same shading as in Figure 1. We see 
that variable dependence is fairly complex, with all combinations of variables being 
significant in different regions. In particular, notice that the procedure has labelled 
both variables redundant in the region around (70, 15), confirming our initial suspi- 
cions. This plot is also highly suggestive, revealing further interesting features. For 
instance, there is also little dependence on wind when temperatures are relatively 
high. Such observations are noteworthy and potentially useful. 

Table 6 

Cross-validated mean squared error sum for the ozone dataset 



Approach 


Error 


Std Dev 


LABAVS-A 


277 


(53) 


LABAVS-B 


284 


(55) 


LOCI 


290 


(55) 


OLS 


491 


(110) 


GBM 


403 


(118) 


GAM 


391 


(98) 


MARS 


457 


(115) 



Example 4: As a second low-dimensional real data example, we use the ethanol 
dataset which has been studied extensively, for example by Loader (1999). The 
response is the amount of a certain set of pollutants emitted by an engine, with 
two predictors: the compression ratio of the engine and the equivalence ratio of air 
to petrol. There are 88 observations, a fairly moderate size. Inspection of the data 
shows strong dependence on the equivalence ratio, but the case for the compression 
ratio is less clear. This suggests LABAVS could be potentially useful in reducing 
error. We performed a similar analysis to that in Example 1, with the results are 
presented in Table 7. 

The results in Table 7 show that this problem is particularly suited to MARS, 
which performed the best. After MARS, LABAVS produced the next strongest 
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Fig 4. Ozone dataset smoothed perspective plot and variable selection plot. 

Table 7 

Cross-validated mean squared error sum for the ethanol dataset 



Approach Error Std Dcv 



LABAVS-A 


0.075 


(0.011) 


LABAVS-B 


0.085 


(0.014) 


LOCI 


0.090 


(0.012) 


OLS 


1.348 


(0.128) 


GBM 


0.104 


(0.020) 


GAM 


0.098 


(0.012) 


MARS 


0.045 


(0.008) 



result, again improving on the traditional local linear model. The GBM and GAM 
models were inferior to the local linear fit. 
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Appendix A: Technical details 

We first prove the following lemma concerning the asymptotic behaviour of h. 

Lemma A.l. The choice of h that minimises the mean integrated squared error is 
asymptotically the minimiser of 

(A.l) {\/A)h^ 2 {KfA Hg + o-^nh^-'RWAc , 

where R{K) = J K(x) 2 dx for the function K, Au g = J tv{H g (x)} f(x)dx and 
Aq = J c Idx. Further, 



(A.2) h = 



da 2 R(K)A c i 1/{d+i) 



n^ 2 {K) 2 A Hg 



Proof: Ruppert and Wand (1994) show that for x in the interior of C we have bias 
and variance expressions 

E{g(x) - g(x)} = (l/2)^ 2 (K)h 2 ti{U g (x)} + o P {h 2 ) , and 

Var{g(x)} = { n - 1 h- d R(K)f(x)- 1 a 2 }{l + o P (l)} . 
Substituting these into the mean integrated squared error expression yields 

MISE = J E{g(x)-g(x)} 2 f(x)dx 

= J [E{g(x)} - g(x)} 2 + Vzv{g(x)}] f(x)dx 

= J(l/4)[i 2 (K) 2 h 4 tr{H g (x)} 2 f(x)dx + o P {h A ) 

+ J n- 1 h- d R{K)f(x)- 1 a 2 f(x)dx + o P (n- 1 h- d ) 

= {l/A)h^ 2 (K) 2 A Hg + ^(nh^RiKJAc + o P (h 4 + n^h^) . 

This establishes the first part of the Lemma. Notice that assumptions (3.1) and 
(3.2) ensure that the factors /i 2 (K) 2 A-u g and R(K)Aq are well defined and strictly 
positive. Elementary calculus minimising (A.l) with respect to h completes the 
Lemma. I 

Observe that we may express Yi using a first order Taylor expansion for g: 
g{x) + V g {x) T {X t -x) + e t + T{x) , 

where the remainder term is T(x) = J2j k e j,k(x)(X^^ — x^)(x\ k ^ —x^) with the 
terms e^\- are uniformly bounded. For local linear regression we aim to show that 
our local linear approximation 70 + A f T (Xi — x) is a good approximation for this 
expansion and that the remainder behaves. The following two results are needed 
before proving the Theorem 3.2 and Theorem 3.3. Firstly, the following version of 
Bernstein's Inequality may be found in Ibragimov and Linnik (1971), pl69. 
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Theorem A. 2 (Bernstein's Inequality). Suppose Ui are independent random vari- 
ables, let A 2 — Y^i=i Var([/j) and S n — Ym=\ Ui- Suppose further that for some 
L > we have 

\E[{Ui - E{U t )} k ]\ < \ VariUJL^kl . 

Then 

P{\S n - E(S n )\ > 2t^T 2 } < 2e-' 2 . 

Secondly, the following lemma contains a proof which is applicable to many 
uniform convergence type results. The structure is similar to that of Masry (1996), 
although it is simplified considerably when using independent observations and 
Bernstein's Inequality. In the proof, let C4 = sup/(x) < 00 and C5 = inf f(x) > 
for x eC. 

Lemma A.3. 

sup\n- 1 j: i e l K H (X l -x)\=o\(n- 1 h- d logn) 1/2 } . 

x£C 1 > 



Proof: Since is independent of Xi and E(ei) = 0, we have EfeKn (Xi — x)} = 0. 
As C is compact we may cover it with L(n) = (n/h d+2 logn) d / 2 cubes I\, . . . , Il(u), 
each with the same side length, proportional to L(n)~ x / d . Then 

sup \n~ 1 Y, i ^K H {X l - x)\ < 

x£C 

max sup \n~ 1 Y, e i K H{X i - x) - n^ 1 Y J ^iK H {X t - x m )\ 
m xecni m 

+ ma,x\n~ 1 J2tiKH(X i - x m )\ = Qi + Q 2 

m 

From the second condition of (3.2) we know that 

C e 

\eiK H (Xi - x) - €iK H (Xi - x m )\ < \\ h^(x - x m ) \\ 

Cfc fh d + 2 logn\ 1/2 _ nl / log 1/2 



" /,'•'•' { n ) ( "' : \ nh-< 

/j \ 1/2 

This expression is independent of x and m, and so Qi < C[ I -^f^ J p" 1 ^2 
which implies that Qi = 0[(^|^ 1 ) 1 / 2 ]. Now with regard to Q 2 , notice that 

(A.3) P(Q 2 > rj) < L(n) supP^n" 1 ^!^ (J*Q - x)\ > r,} . 

X 

Letting B 2 = sup x K(u) and using the first property in (3.5) we see that for k — 
3,4,..., 

\E[{eiK H (Xi-x)} a }\ < <j 2 C«- 2 J K H (u-x) a f(u)du 

< V a T{eiK H (Xi - x)}(B 2 C 3 ) a ~ 2 . 

Also, if S 3 = / K(u) 2 du wc can show that V&r{K H (X t - x)} < C 4 cr 2 B 3 h- d . We 
may let n be large enough so that 

(BlM ./.<v»33, 

z±s 2 u 3 
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for some B4 to be determined below. Then by Bernstein's inequality 



P< 



n- 1 £ e t K H { Xl -x)\>2{B A log n) ^ (°^^) " 
< p{|^e iW -,)|> 2(B 4 log n) 1 ^ £( e f K H [X t - x)*) 



< 2e~ Bi logn < 



Comparing this inequality to (A. 3) and choosing B4 large enough then the expres- 
sion 2L(n)n~ Bi is summable, by the Borel-Cantelli lemma we may conclude that 
Q2 = 0[(^f?) 1/2 ] and the lemma is proved. ■ 



In a similar fashion it is also possible to prove, letting Z^ = — x and ( = 
(n- 1 h- d log n) 1 / 2 , 



(A.4) supjn- 1 ^^^)- 
(A.5) supjn- 1 ^^) 2 - 

(A.6) supjn- 1 ^^^^)- 
(A.7) supjn- 1 ^^ '^^)- 
(A.8) supjn- 1 ^^^^^)- 
(A.9) sup, In- 1 e jk Z^z\ k } K H (Z t ) - 
(A.10) supJn-^e^zWzj 

-E{e jk Zl 



E{K H (Z t )}\ = 0(C) 

E{K H (Z^}\ = 0(h- d () 

E{Z^K H (Z t )}\ =0(h() 

E{eiZ^K H (Zi)}\ = 0(h() 

E{Z^zl k) K H (Zi)}\ = 0(h 2 () 

EiejkZ^Z^KniZ,)}] =0(h 2 () 
'K H (Zi) 

^^z^kmz^I =0(h 3 () 



Standard treatment of the expectation integrals reveals that 



(A.H) 


E{K H (Zi)} 


= f{x) + 0{h) 


(A.12) 


EiKuiZC) 2 } 


= h- d {f(x)R(K) + 0(h)} 


(A.13) 




= 0(h 2 ) 


(A.14) 


EiaZ^KxiZi)} 


= 


(A.15) 


E{Z^z\ k) K H {Z t )} 


= 0(tf) 


(A.16) 


E{e jk Z^Z^K H ( Zi )} 


= 0(h 2 ) 


(A.17) 


E{e jk Z^Z^Z^K H { Zi )} 


= 0{h±) 



Ifftxn V( d + 4 ) ; as it will be under Lemma A.l, then the asymptotic rates in 
the expectations (A.11)-(A.17) will dominate those of the deviations (A.4)-(A.10), 
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with the execption of (A. 14). We may then conclude that, uniformly on x, 



(A.18) 


n- 1 J2K H (Z t ) 

i 


= f(x) + 0(h) 


(A.19) 


i 


= h- d {f(x)R(K) + 0(h)} 


(A.20) 


n-^eiKniZi) 

i 


= 0(h) 


(A.21) 


n-^Z^ ] K H {Z t ) 

i 


= 0(h 2 ) 


(A.22) 


n-^dZ^KHiZi) 

i 


= 0(h 2 ) 


(A.23) 


n-^Z^Z^K^Z,) 

i 


= 0(h 2 ) 


(A.24) 


n-^e jk Z^Z^K H ( Zi ) 

i 


= 0(h 2 ) 


(A.25) 


n-^e jk Z^Z^Z^K H ( Zi ) 


= <W 4 ) 



Proof of Theorem 3.2: From Lemma A.l we know that an estimator of h that 
minimises mean integrated squared error will satisfy h x n _1 /( d+4 ^. Theorem 3.1 
then implies that 

sup \(Dlg)(x) - (Dig)(x)\ = O(h^\ogn) . 

xec,j=i,...,d 

Notice that the estimates jj at x in the minimisation (2.5) are exactly the estimates 
(Dig)(x). The adjusted parameter estimates $j in (2.8) therefore satisfy 

(A.26) ft = (Dr g )( x ){J2(X^ - xP)*K H {Xi z)} 1/2 ■ 

Let Pj = (D j g)(x){nh 2 fX2(K)f(x)} 1 ^ 2 . We aim to show that /3 converges to (3 
sufficiently fast uniformly in x. 

sup \$j-ft\ < sn V \{nh 2 ^(K)f(x)} 1 ' 2 {(bTg)(x)-(D^g)(x)}\ 

x£C,j x,j 

+ sup|(5r g )(x)[{E(Ap' ) - Xx ^) 2 K H (Xi - x)Y' 2 - {nh 2 ^(K)f(x)y/ 2 }\ 

x,j 

< 0(h 2 ^fn\ogn) 

(A.27) +A IS M V \{Y J (X\ 3) -X x 3) ) 2 K H (X t -x)Yl 2 - {nh 2 fi 2 (K)f(x)y/ 2 \ 

x,j 

In the first term of the last line we use the fact that (D^g) is bounded and (Dig) 
converges uniformly so may be bound be some constant A±, and for the second 
term we use the boundedness of f(x) and (A.26). 
Focusing on the first term, note that 

-pug - w - sup i " x) I - °"- 2 ' ■ 

x. j x,j K H\^i - X) 
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using (A.18) and (A.21). Thus 



Y J {X l -X^ fK H {X l -x) = Y.{ X i j) -^ j) +0{h 2 )} 2 K H {X i -x) 

i 

= 0(nh 4 ) + J2( X i J) - x^fK H {Xi - x) , 



again using (A.18) and (A.21). Now we consider the expectation of (X^— x^) 2 Kh (Xi 
x) carefully, 



E{{x\ j) -x^fK H {Xi-x)} = j -x^) 2 K H {u-x)f(u)du 

= h 2 j {z {j) ) 2 K(z)f(x + hz)dz 

= h 2 J {z^) 2 K(z){f(x) + hz T V f {x) + 0(h 2 )}dz 

= h 2 n 2 (K)f(x) + 0(h 4 ). 



The differentiability assumptions in the statement of the Theorem ensure that this 
formulation is uniform over all x in C. Using this and (A. 8) in (A. 27) and noting 
that if x -> then (1 + x) 1 / 2 - 1 = 0(x), we see that 



sup < A 1 ^\{nh 2 ^ 2 (K)f{x) + 0(nh 2 ^g~n))} 1 ' 2 ~ {nh 2 m (K)f(x)}^ 2 \ 

xec,j xj 

+ 0(h 2 ^n log n) 

= supA.lnh 2 ^).?^)} 1 / 2 ^ + O(^v^oi^)} 1 / 2 - 1| 

X 

+ 0{h 2 ^Jn logn) 

= sup A 1 {nh 2 ^ 2 (K)f(x)} 1 ^ 2 0(h 2 y^g~^) + 0{h 2 yJn\ogn) 

X 

= 0{h 2 ^n\ogn) 



We do not need to worry about small nonzero values of f)j by our assumption 
on Os, so the nonzero /3j grow at O^^h). Further, the estimate error of is 
0(h 2 ^fn logn) uniformly in x. A A that grows at some rate between these two 
as, suggested in the Theorem, will be able to separate the true variables from the 
redundant ones with probability tending to 1. ■ 



Proof of Theorem 3.3: Let fiQ, p, be the parameter estimates for the case where 
the jth variable is removed from consideration, so fi^ = 0. Theorem 3.1 ensures 
that the maximum distance for the estimators 70, /to from g(x) is O(C) and similarly 
7,/t converge to the derivative T> g (x) at 0{C,h~ x ), with the exception of fi^ = 0. 
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Thus we may expand the sum of squares difference and use results (A.18)-(A.25): 



SSj(x) — SS(x) 



= n 



1 J2i Y * - Ao - ^ T Z i \ 2 K H {Z i ) - n- 1 J^i^ ~%~ 1 T Z^K^) 



= n- 1 Y,K H {Z l ) 



{0(0 + e l+ V g {x)^z\» + T(x) + 0{(h-^ k Z\ k) } 



(k) X 2 



{OiO + ei+TW + OiCh-^Z^y 

= n- 1 J2 KHiZ.) [o(C 2 ) + eiO(C) + T(x)0(Q + O^OE*^ 

+ 0{QV g {x)^z\ j) + 2e l V g {x)^z\ 3) + 2T{x)V g {x)^ z\ j) 

+ OiCh-^V^Y^Z^Z^ + (V g (x)WZ^f + 0{Ch-^ k z\ k) 

+ oich-'mx^z^ + o{eh-^ k A k) z^ 

= 0(C 2 ) + (D s W ) 2 0(ft 2 ) . 



This shows the behaviour of the numerator in our expression. Note that our as- 
sumption on 0$ ensures that when \V g ^ | is nonzero it is bounded away from 0, so 
true separation is possible. In a similar fashion to that above we may expand and 
deal with the denominator nT 1 XX^i — 7o — 1 T Zi) 2 Kjj(Zi). The dominating term 
here is the asymptotic expectation of n _1 e iKii(Zi), which tends to a 2 f(x), and 
everything else converges to zero at h or faster, uniformly in x. Therefore, so as 
long as A shrinks faster than h 2 but slower than Q 2 — h 4 logn, the variable selection 
will be uniformly consistent. ■ 



Before proving Theorem 3.4, we prove the following three lemmas. The first 
allows us to separate out the effects of various variables in the LABAVS procedure. 
The latter two are concerned with the change in estimation error for local and 
global variable redundance respectively. 



Lemma A. 4. Let B\ and B 2 be disjoint subsets of {1, ... ,d} such that B\UB 2 = 
{1, ...,d}. The final estimates of the LABAVS procedure would be the same as 
applying the bandwidth adjustment, that is steps 3 and 4, in the procedure twice; 
the first time only expanding the bandwidths at x of those variables in A~ {x)f\B\ to 
the edges of the maximal rectangle and shrinking those remaining, and the second 
time expanding the variables in A~ (x) n B 2 and shrinking the variables in A + (x). 



Proof: Choose x E C. With some slight abuse of notation, since the bandwidths are 
possibly asymmetric, let Hi(x) denote the adjusted bandwidths after the first step 
of the two-step procedure, with shrunken variables having bandwidth hi . Similarly 
let H 2 {x) denote the bandwidths after the second step, with bandwidth on the 
shrunken variables h 2 . Further, let d\(x) equal the cardinality of A + (x) U B\ and 
d 2 (x) equal the cardinality of A + (x). The bandwidths for the redundant variables 
are expanded to the edges of the maximal rectangle, so we need only show that the 
resulting shrunken bandwidth is the same as when applying the one-step version of 
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the algorithm. Using expression (2.14) we know that 

'E{dx{X)}E{V{X, Hi)} 



hi = M(H 1 ,H) = h 
h 2 = M(H 2 ,H 1 )=h 1 



dE{V(X,H)} 
E{d 2 (X)}E{V(X,H 2 )} 



and 



Substituting the first expression into the second gives 

~E{d 2 {X)}E{V{X,H 2 )} 



h 2 = h 



dE{V(X, H)} 



which recovers the equation in the one-step bandwidth adjustment. Thus the band- 
widths arc unchanged for every x GC. ■ 

Lemma A. 5. Suppose that h is chosen to minimise squared error as in (3.3). Also, 
suppose that the LABAVS procedure identifies that no variables are globally redun- 
dant but some (possibly all) variables are locally redundant and that the local re- 
dundancy takes place on a set of non-zero measure. Then the LABAVS procedure re- 
duces the overall MISE of the estimation of g by a factor of M[{H L (X) , H u (X)} , H] < 
1. 

Proof: Wc shall ignore the difficulties associated with incorrect selection on 0$, as 
it only affects an arbitrarily small subset of the domain. With probability tending 
to one we have correct variable classification, so we work under this assumption. 
Since some variables are relevant in some regions, the choice of hi is well defined. 
Pick x € C and let u + denote the components of d- vector u indexed by A + (x) and 
u~ be the residual components. We can express the density f(x) as f(x + ,x~) so 
the relevant and redundant components may be treated separately. From (A. 18) 
and (A. 19) we know that 



V(x,H) = 



h- d {R(K)f(x) + 0(h)} 



[ AW 

I/O*) 



0(h) 



{f{x) + o{h)Y 

Taking an expectation over x we see that 

(A.28) E{V(X, H)} = h~ d R(K)Ac + 0(^ (d - 1) ) . 

For convenience let H*(x) denote the asymmetric bandwidths H L (x) and H u (x). 
We now show that the factor M{H*(X),H} is less than 1. Firstly observe that 
V(x, H) = V(x,H*(x)) whenever A + (x) = (l,...,d). Consider the case when 
A + (x) 7^ (1, . . . , d). In particular assume that k components are redundant at x. 
We see that 



E{K H *(Xi-x) 2 } = 



(h')-^- k) f(x+ + h'z+,u-) Q K*{z {l) ) 2 

jeA+(x) 



n 



h*Ax)' 2 K*{h*Ax)- 1 {u^ - x^)Y 



dz + d% 



= (h') 



-(d-k) 



R{K) 



(d-k)/d 



O(h') 



n h*{x)- 2 K*{h*{x)-\u 
jeA-(x) 

= (h')- {d ~ k) {Bi(x) + 0{h')}, 



0) 



,0) 



)} 2 f(x+,u-)du 
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where B\(x) is a uniformly bounded and strictly positive number depending only 
on x. An argument using Bernstein's Theorem similar to that in Lemma A. 3 shows 
that the uniform bound of n^ 1 Kh* (Xi — x) 2 away from E{Kh' (Xi — x) 2 } is 
0[(h , )- (d -^{n{h'y d - k ^-^ 2 ^/]og^], so we may deduce that 

n- 1 J2 Kh- (Xi - x) 2 = (h'y^iB^x) + 0(h')} . 

In a similar fashion we can show that 

n- 1 J2 K h* (Xi - x) = f x+ (x+)B 2 (x) + 0(ti) , 

where B 2 (x) is a uniformly bounded and strictly positive number depending only 
on x. This leads to 

(A.29) V{ ^H*) = (hr {d - k) {§^+0(h')^ . 

Let £ denote the event that A + (X) = (1, . . . , n) and £ c the complement. We know 
P(£ c ) > by assumption and also that given £ c is true, V(X, H*) = 0{(h')^ d -^} 
from (A.29). Thus as n — > oo, for some strictly positive constants B 3 , B 4 and B 5 , 

E{V(X,H*)} P(£)E{V(X,H*) \£} + P(£ C )E{V(X, H*) \£ c } 

E{V(X, H)} P{£)E{V(X,H) \£} + P(£ C )E{V(X, H) \£ c } 

h- d {Bz + Q(h)} + 0{{h')- (d ~ 1] } 
h- d {B 3 + 0(h)} + h- d {B i + 0(h)}' 

where £? 3 , B4 are constants satisfying B 3 > and B4 > 0. But from our definition 
of hi in (2.14) and the definition of M(H*(X),H), we may deduce that 

h'\^B, + 0{h d (h')- {d - 1) l +0{hy 



h ) B 3 + B 4 

From this expression it follows that both sides must be less than 1 in the limit. 
Thus we have M(H*,H) < 1 asymptotically, as required. 

Ruppert and Wand (1994) show that for a point x in the interior of C, using a 
bandwidth matrix H, Var(^(a;)|Xi, . . . , X n ) is equal to 

a 2 eJ(X T WXy 1 X T W 2 X(X T WX)- 1 e 1 , 

where X is the n x (p + 1) matrix (1,X), e\ is a p-vector with first entry 1 and 
the others 0, and W is an n x n diagonal matrix with entries Kn(Xi — x). This 
variance may be reexpressed as 

„2 J2^H(Xj-x) 2 TfyT v\-l vT YY fT v\-l 

a rv-^ (v \T2 e i \X X) X X(X X) ei. 

{l^ K H(Xi - x)Y 

Taking ratios of the expectations for the variance factors under the adjusted and 
initial bandwidths recovers the expression M(H*,H) in (2.12). Thus the variance 
term in the MISE is reduced by a factor of M(H* , H). Furthermore, the bias term 
in the MISE, in which we may ignore the zero bias contributed be the nth variable 
where it is redundant, is reduced by a factor of (h 1 /h) 4 which, from (2.14), is strictly 
less than the factor M(H*,H). Thus the MISE is reduced by the factor M(H*,H) 
as required. ■ 
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Lemma A. 6. Suppose that h is chosen to minimise squared error as in (3.3). Also, 
suppose that the LABAVS procedure finds that all variables are relevant everywhere 
in C except for a single variable which is globally irrelevant. Then the LABAVS 
procedure reduces the overall MISE of the estimation of g by a factor ofM(H*,H) < 
1. Furthermore the resulting bandwidth h' is asymptotically optimal, in the sense 
that it minimises the d — 1 dimensional MISE expression. 

Proof: Let C denote the d— 1 dimensional space formed by removing the irrelevant 
variable and denote the volume of this space by Ac- We know that our initial h 
satisfies (A. 2). By similar reasoning it follows that we are required to show that our 
adjusted bandwidth is asymptotically equal to 



(A.30) fcopt 



(d-l)a 2 R{K^ d -^' d Ac' 
nn 2 {K) 2 A H 



l/(d+3) 



which is the bandwidth the minimises MISE in the reduced dimension case. Here 
Ac denotes the volume of the d— 1 dimensional case. Equivalcntly, combining (A. 2) 
and (A.30), it is sufficient to show in the limit that 

(A.31) ^ +3 - 



h d +* dR(Ky/ d A c ' 
Arguments similar to those in the previous Lemma can be made to show 

n- 1 J2 Kh- {Xi - xf = (/i')- (d - 1) { J R(^) (d - 1)/d /x(-n) (x { ~ n) ) + 0(h')}, and 

n- 1 ]T K H , (X t - x) = {R{K)W d f x( - n) (*(-")) + 0(ti) . 

Thus 

E{V(X,H*)} = {h')- {d - 1) \R{Ki d - 1 ^ d J J /*<-») (* ( - n) )-7xC-»>(* ( - n) ) 

■f X M\ xl -n)(uW)duWdu<-- n ) + 0(/l')| 

= (h')- {d - 1) {Ac>R(K) (d - 1)/d + 0(h')} 
Combining this with (A. 28) and (2.14) gives 

Rearranging this last expression and letting n — > oo leads to the required expression 
(A.31). Note that (A.31) also implies that (ft,'//i) d+3 ft, _1 is asymptotically constant, 
so h'/h -> 0. This in turn implies that (ti/h) 4 = M(H*, H)(d - l)/d tends to zero 
so asymptotically M(H*,H) < 1 as required. The argument that the MISE is in 
fact reduced by the factor M(H*,H) is entirely analogous to the previous Lemma. 



Proof of Theorem 3.4: Correct variable selection at every point x G C with 
probability tending to 1 on the set C \Os for locally redundant variables, and C 
for globally redundant variables, is guaranteed by Theorem 3.2 or Theorem 3.3. 
For a given point x, repeated application of Lemma A. 4 allows us to consider the 
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eventually result by adjusting the bandwidths for any partition of variables in any 
order. Choose an order in which globally redundant variables are treated first, one at 
a time, followed by a final adjustment for those variables that are locally redundant. 
Lemma A. 6 ensures that when allowing for each globally redundant variable, the 
resulting bandwidths in the remaining variables is asymptotically optimal. This 
means that the strong nonparametric oracle property is satisfied after the global 
bandwidth adjustments. Lemma A. 5 provides the quantification of the additional 
benefit resulting from the local variable removal. ■ 
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